ANALYTICAL AVENGERS

INFO 523 - Project Final

Project description
Author
Affiliation

ANALYTICAL AVENGERS-
MELIKA AKBARSHARIFI, Divya liladhar Dhole, Mohammad Ali Farmani,
H M Abdul Fattah, Gabriel Gedaliah Geffen, Tanya George, Sunday Usman

School of Information, University of Arizona

Abstract

This study investigates the relationship between age demographics and severe crashes, with a focus on developing a predictive model to enhance road safety in Massachusetts. Using a crash dataset from January 2024, we explore how age correlates with the severity of crashes and examine environmental factors like lighting, weather, road conditions, speed limits, and the number of vehicles involved. Our analysis reveals crucial patterns, indicating which age groups, both drivers and vulnerable users, are at greater risk of severe crashes. Additionally, we identify environmental conditions that contribute to the likelihood and severity of crashes, providing insights for targeted safety measures. To classify crash severity, we experimented with various machine learning (ML) techniques, including logistic regression, decision trees, random forests, and K Nearest Neighbors (KNN). Our models achieved a 100% prediction accuracy, indicating a strong ability to classify crash severity based on the selected features. However, the absence of road volume or vehicle miles traveled data poses a limitation in contextualizing the frequency of crashes. The outcomes of our research offer valuable tools for policymakers and practitioners, allowing for more proactive safety measures and resource allocation. By accurately predicting crash risks based on age demographics and environmental conditions, authorities can implement preemptive interventions to reduce severe accidents. Ultimately, this study contributes to a data-driven approach to road safety, with the potential to make tangible improvements in public safety and traffic management.

Introduction

Understanding the factors contributing to severe car crashes is crucial for improving road safety and reducing traffic-related injuries and fatalities. This project aims to develop a predictive model that correlates age demographics with severe crashes in Massachusetts. The ultimate goal is to identify key risk factors and provide data-driven insights for implementing effective safety measures.

Our team is analyzing a comprehensive dataset of car crashes from January 2024, collected from the Massachusetts Registry of Motor Vehicles. This dataset comprises 72 dimensions, encompassing a range of variables, including crash characteristics, driver demographics, environmental conditions, and vehicle information. By examining these variables, we seek to uncover patterns that link age with severe crashes, offering valuable insights into potential high-risk groups and circumstances.

Our analysis focuses on two main research questions: identifying the age groups most at risk for severe crashes and exploring the role of environmental factors such as lighting, weather, road conditions, and speed limits. Additionally, we aim to develop a predictive model capable of classifying crash severity based on these variables. To achieve this, we used multiple binary classification models, which are known for their simplicity and effectiveness in classification tasks.

The methodology for our analysis involved several key steps. First, we pre-processed the dataset to handle missing data, standardize categorical variables, and scale numerical features. Next, we conducted exploratory data analysis to identify significant correlations and patterns. To predict crash severity, we trained a KNN model using a subset of the data and evaluated its performance on a separate test set. The model’s accuracy, precision, recall, and F1-score were measured to determine its effectiveness. The high accuracy achieved in the model’s predictions indicates its potential for real-world application in road safety.

This report details our approach to analyzing the Massachusetts crash dataset, including the steps taken to process the data, build the predictive model, and evaluate its performance. We discuss our findings and provide insights into which age groups are most at risk, along with the environmental factors that contribute to severe crashes. Through this work, we aim to contribute to road safety practices and provide useful information for policymakers, traffic safety professionals, and other stakeholders interested in reducing traffic-related incidents and enhancing public safety.

Questions

  1. Which age groups are at the highest risk of getting into severe crashes, and how do factors like lighting, weather, road conditions, speed limits, and the number of vehicles involved contribute to the likelihood of certain age groups being in more danger?
  2. Is it possible to develop a model that can accurately classify the severity of crashes based on our findings from the previous question about factors that contribute to said level of danger?

Analysis Plan

Distinct data types of features in crash_data:
[dtype('O') dtype('float64')]
First few rows of the crash data: 
   Crash Number City Town Name  Crash Date  \
0      5342297         LOWELL  01/01/2024   
1      5342292         LOWELL  01/01/2024   
2      5342292         LOWELL  01/01/2024   
3      5342292         LOWELL  01/01/2024   
4      5342292         LOWELL  01/01/2024   

                        Crash Severity Crash Status Crash Time  Crash Year  \
0                     Non-fatal injury         Open    3:26 AM      2024.0   
1  Property damage only (none injured)         Open   12:48 AM      2024.0   
2  Property damage only (none injured)         Open   12:48 AM      2024.0   
3  Property damage only (none injured)         Open   12:48 AM      2024.0   
4  Property damage only (none injured)         Open   12:48 AM      2024.0   

  Max Injury Severity Reported  Number of Vehicles Police Agency Type  ...  \
0          Possible Injury (C)                 1.0       Local police  ...   
1       No Apparent Injury (O)                 2.0       Local police  ...   
2       No Apparent Injury (O)                 2.0       Local police  ...   
3       No Apparent Injury (O)                 2.0       Local police  ...   
4       No Apparent Injury (O)                 2.0       Local police  ...   

    X   Y Latitude Longitude Vehicle Unit Number Vehicle Make Vehicle Model  \
0 NaN NaN      NaN       NaN                 1.0         HOND          HR-V   
1 NaN NaN      NaN       NaN                 1.0         NISS        ALTIMA   
2 NaN NaN      NaN       NaN                 2.0         HOND        ACCORD   
3 NaN NaN      NaN       NaN                 2.0         HOND        ACCORD   
4 NaN NaN      NaN       NaN                 2.0         HOND        ACCORD   

  Person Number   Age         Sex  
0           1.0  32.0  F - Female  
1           1.0  60.0    M - Male  
2           2.0   NaN         NaN  
3           3.0  31.0    M - Male  
4           4.0   NaN    M - Male  

[5 rows x 72 columns]

Question 1

       Crash Year  Number of Vehicles  MassDOT District  Total Fatalities  \
count     25547.0        25547.000000      25547.000000      25547.000000   
mean       2024.0            1.976749          4.019063          0.003562   
std           0.0            0.702530          1.325421          0.068730   
min        2024.0            1.000000          1.000000          0.000000   
25%        2024.0            2.000000          3.000000          0.000000   
50%        2024.0            2.000000          4.000000          0.000000   
75%        2024.0            2.000000          5.000000          0.000000   
max        2024.0            9.000000          6.000000          3.000000   

       Total Non-Fatal Injuries   Speed Limit              X              Y  \
count              25547.000000  23389.000000   21002.000000   21002.000000   
mean                   0.318824     34.394502  205930.128516  887470.383156   
std                    0.728140     12.979679   49539.383540   31782.135543   
min                    0.000000      1.000000   44708.708525  779050.104521   
25%                    0.000000     25.000000  179154.370652  870946.937400   
50%                    0.000000     30.000000  224092.943601  889548.926635   
75%                    0.000000     40.000000  237299.607076  908937.437400   
max                    8.000000     65.000000  327948.082270  958417.191000   

           Latitude     Longitude  Vehicle Unit Number  Person Number  \
count  20823.000000  20823.000000         25220.000000   25547.000000   
mean      42.234940    -71.431249             1.489968       1.918699   
std        0.287058      0.600959             0.637851       1.568750   
min       41.251611    -73.386241             1.000000       1.000000   
25%       42.086592    -71.756001             1.000000       1.000000   
50%       42.254041    -71.209095             1.000000       2.000000   
75%       42.428108    -71.049485             2.000000       2.000000   
max       42.874973    -69.962834             9.000000      42.000000   

                Age  
count  23002.000000  
mean      38.952265  
std       18.503512  
min        0.000000  
25%       24.000000  
50%       36.000000  
75%       53.000000  
max       99.000000  
Age                       2548
Light Conditions             3
Weather Conditions           3
Road Surface Condition       3
dtype: int64
Age                       0
Light Conditions          0
Weather Conditions        0
Road Surface Condition    0
dtype: int64

::: {#cell-Visualization of age group and crash severity .cell execution_count=7}

:::

::: {#cell-Visualizations for crash severity and Light Conditions .cell execution_count=8}

:::

::: {#cell-Visualizations for crash severity and weather Conditions .cell execution_count=10}

:::

::: {#cell-Visualizations for crash severity and road surface Conditions .cell execution_count=12}

:::

::: {#cell-Visualizations for number of crashes by Age Group and Light Conditions .cell execution_count=14}

:::

Question 2:

Analysis of Missing Values for numerical features: 

                           Missing Values  Percentage (%)
Crash Year                             3        0.012189
Number of Vehicles                     3        0.012189
MassDOT District                       3        0.012189
Total Fatalities                       3        0.012189
Total Non-Fatal Injuries               3        0.012189
Speed Limit                         1984        8.060781
X                                   4442       18.047373
Y                                   4442       18.047373
Latitude                            4612       18.738065
Longitude                           4612       18.738065
Vehicle Unit Number                  326        1.324503
Person Number                          3        0.012189
Age                                    0        0.000000
feature_variable                       0        0.000000 


Analysis of Missing Values for categorical features: 

                                                     Missing Values  \
Crash Number                                                     0   
City Town Name                                                   0   
Crash Date                                                       3   
Crash Status                                                     3   
Crash Time                                                       3   
Max Injury Severity Reported                                     3   
Police Agency Type                                               3   
State Police Troop                                           19852   
Age of Driver - Youngest Known                                 489   
Age of Driver - Oldest Known                                   487   
Age of Vulnerable User - Youngest Known                      23848   
Age of Vulnerable User - Oldest Known                        23848   
Crash Hour                                                       3   
Driver Contributing Circumstances (All Drivers)                668   
Driver Distracted By (All Vehicles)                           4628   
First Harmful Event                                              3   
Is Geocoded                                                      3   
Light Conditions                                                 0   
Manner of Collision                                              3   
Vulnerable User Action (All Persons)                         23921   
Vulnerable User Location (All Persons)                       23921   
Vulnerable User Type (All Persons)                           23864   
RMV Document Numbers                                            70   
Road Surface Condition                                           0   
Roadway Junction Type                                            3   
RPA Abbreviation                                                 3   
Traffic Control Device Type                                      3   
Trafficway Description                                           3   
Vehicle Actions Prior to Crash (All Vehicles)                    3   
Vehicle Configuration (All Vehicles)                            38   
Vehicle Emergency Use (All Vehicles)                           661   
Vehicle Towed From Scene (All Vehicles)                         78   
Vehicle Travel Directions (All Vehicles)                         3   
Weather Conditions                                               0   
County Name                                                      3   
Crash Report IDs                                                 3   
FMCSA Reportable (All Vehicles)                                  3   
FMCSA Reportable (Crash)                                         3   
First Harmful Event Location                                     3   
Geocoding Method                                              4442   
Hit and Run                                                      3   
Locality                                                     24597   
Most Harmful Event (All Vehicles)                              301   
Road Contributing Circumstance                               16861   
School Bus Related                                               3   
Traffic Control Device Function                                  3   
Vehicle Sequence of Events (All Vehicles)                      235   
Work Zone Related                                                3   
Vulnerable User Sequence of Events (All Persons)             24594   
Vulnerable User Distracted By (All Persons)                  24598   
Vulnerable User Traffic Control Type (All persons)           24596   
Vulnerable User Origin Destination (All Persons)             24596   
Vulnerable User Contributing Circumstances (All...           24598   
Vulnerable User Alcohol Suspected Type (All Per...           24599   
Vulnerable User Drug Suspected Type (All Persons)            24599   
Vehicle Make                                                   818   
Vehicle Model                                                 5149   
Sex                                                           1556   
Age Group                                                        0   
Weather Group                                                22961   

                                                    Percentage (%)  
Crash Number                                              0.000000  
City Town Name                                            0.000000  
Crash Date                                                0.012189  
Crash Status                                              0.012189  
Crash Time                                                0.012189  
Max Injury Severity Reported                              0.012189  
Police Agency Type                                        0.012189  
State Police Troop                                       80.656564  
Age of Driver - Youngest Known                            1.986755  
Age of Driver - Oldest Known                              1.978629  
Age of Vulnerable User - Youngest Known                  96.891886  
Age of Vulnerable User - Oldest Known                    96.891886  
Crash Hour                                                0.012189  
Driver Contributing Circumstances (All Drivers)           2.714013  
Driver Distracted By (All Vehicles)                      18.803072  
First Harmful Event                                       0.012189  
Is Geocoded                                               0.012189  
Light Conditions                                          0.000000  
Manner of Collision                                       0.012189  
Vulnerable User Action (All Persons)                     97.188478  
Vulnerable User Location (All Persons)                   97.188478  
Vulnerable User Type (All Persons)                       96.956893  
RMV Document Numbers                                      0.284403  
Road Surface Condition                                    0.000000  
Roadway Junction Type                                     0.012189  
RPA Abbreviation                                          0.012189  
Traffic Control Device Type                               0.012189  
Trafficway Description                                    0.012189  
Vehicle Actions Prior to Crash (All Vehicles)             0.012189  
Vehicle Configuration (All Vehicles)                      0.154390  
Vehicle Emergency Use (All Vehicles)                      2.685573  
Vehicle Towed From Scene (All Vehicles)                   0.316906  
Vehicle Travel Directions (All Vehicles)                  0.012189  
Weather Conditions                                        0.000000  
County Name                                               0.012189  
Crash Report IDs                                          0.012189  
FMCSA Reportable (All Vehicles)                           0.012189  
FMCSA Reportable (Crash)                                  0.012189  
First Harmful Event Location                              0.012189  
Geocoding Method                                         18.047373  
Hit and Run                                               0.012189  
Locality                                                 99.934994  
Most Harmful Event (All Vehicles)                         1.222931  
Road Contributing Circumstance                           68.504449  
School Bus Related                                        0.012189  
Traffic Control Device Function                           0.012189  
Vehicle Sequence of Events (All Vehicles)                 0.954780  
Work Zone Related                                         0.012189  
Vulnerable User Sequence of Events (All Persons)         99.922805  
Vulnerable User Distracted By (All Persons)              99.939057  
Vulnerable User Traffic Control Type (All persons)       99.930931  
Vulnerable User Origin Destination (All Persons)         99.930931  
Vulnerable User Contributing Circumstances (All...       99.939057  
Vulnerable User Alcohol Suspected Type (All Per...       99.943119  
Vulnerable User Drug Suspected Type (All Persons)        99.943119  
Vehicle Make                                              3.323447  
Vehicle Model                                            20.919839  
Sex                                                       6.321862  
Age Group                                                 0.000000  
Weather Group                                            93.288100   

  Crash Number City Town Name  Crash Date Crash Status Crash Time  Crash Year  \
0      5342297         LOWELL  01/01/2024         Open    3:26 AM      2024.0   
1      5342292         LOWELL  01/01/2024         Open   12:48 AM      2024.0   
2      5342292         LOWELL  01/01/2024         Open   12:48 AM      2024.0   
3      5342292         LOWELL  01/01/2024         Open   12:48 AM      2024.0   
4      5342292         LOWELL  01/01/2024         Open   12:48 AM      2024.0   

  Max Injury Severity Reported  Number of Vehicles Police Agency Type  \
0          Possible Injury (C)                 1.0       Local police   
1       No Apparent Injury (O)                 2.0       Local police   
2       No Apparent Injury (O)                 2.0       Local police   
3       No Apparent Injury (O)                 2.0       Local police   
4       No Apparent Injury (O)                 2.0       Local police   

  Age of Driver - Youngest Known  ...   Latitude  Longitude  \
0                          25-34  ...  42.339231 -71.207633   
1                          55-64  ...  42.339231 -71.207633   
2                          55-64  ...  42.339231 -71.207633   
3                          55-64  ...  42.339231 -71.207633   
4                          55-64  ...  42.339231 -71.207633   

  Vehicle Unit Number Vehicle Make Vehicle Model Person Number   Age  \
0                 1.0         HOND          HR-V           1.0  32.0   
1                 1.0         NISS        ALTIMA           1.0  60.0   
2                 2.0         HOND        ACCORD           2.0  36.0   
3                 2.0         HOND        ACCORD           3.0  31.0   
4                 2.0         HOND        ACCORD           4.0  36.0   

          Sex     Age Group feature_variable  
0  F - Female         25-35                1  
1    M - Male  60 and above                0  
2    M - Male         35-50                0  
3    M - Male         25-35                0  
4    M - Male         35-50                0  

[5 rows x 58 columns]
  Crash Number City Town Name  Crash Date Crash Status Crash Time  Crash Year  \
0      5342297         LOWELL  01/01/2024         Open    3:26 AM         0.0   
1      5342292         LOWELL  01/01/2024         Open   12:48 AM         0.0   
2      5342292         LOWELL  01/01/2024         Open   12:48 AM         0.0   
3      5342292         LOWELL  01/01/2024         Open   12:48 AM         0.0   
4      5342292         LOWELL  01/01/2024         Open   12:48 AM         0.0   

  Max Injury Severity Reported  Number of Vehicles Police Agency Type  \
0          Possible Injury (C)           -1.388213       Local police   
1       No Apparent Injury (O)            0.027963       Local police   
2       No Apparent Injury (O)            0.027963       Local police   
3       No Apparent Injury (O)            0.027963       Local police   
4       No Apparent Injury (O)            0.027963       Local police   

  Age of Driver - Youngest Known  ...  Latitude Longitude Vehicle Unit Number  \
0                          25-34  ...  0.325956  0.330683           -0.760059   
1                          55-64  ...  0.325956  0.330683           -0.760059   
2                          55-64  ...  0.325956  0.330683            0.806457   
3                          55-64  ...  0.325956  0.330683            0.806457   
4                          55-64  ...  0.325956  0.330683            0.806457   

  Vehicle Make Vehicle Model Person Number       Age         Sex  \
0         HOND          HR-V     -0.587190 -0.376621  F - Female   
1         NISS        ALTIMA     -0.587190  1.193908    M - Male   
2         HOND        ACCORD      0.041754 -0.152259    M - Male   
3         HOND        ACCORD      0.670698 -0.432711    M - Male   
4         HOND        ACCORD      1.299642 -0.152259    M - Male   

      Age Group feature_variable  
0         25-35                1  
1  60 and above                0  
2         35-50                0  
3         25-35                0  
4         35-50                0  

[5 rows x 58 columns]
   Crash Year  Number of Vehicles  MassDOT District  Total Fatalities  \
0         0.0           -1.388213         -0.016167         -0.052805   
1         0.0            0.027963         -0.016167         -0.052805   
2         0.0            0.027963         -0.016167         -0.052805   
3         0.0            0.027963         -0.016167         -0.052805   
4         0.0            0.027963         -0.016167         -0.052805   

   Total Non-Fatal Injuries  Speed Limit       X         Y  Latitude  \
0                  0.905249     0.056513  0.3275  0.321189  0.325956   
1                 -0.447732    -0.343256  0.3275  0.321189  0.325956   
2                 -0.447732    -0.343256  0.3275  0.321189  0.325956   
3                 -0.447732    -0.343256  0.3275  0.321189  0.325956   
4                 -0.447732    -0.343256  0.3275  0.321189  0.325956   

   Longitude  ...  Vehicle Model_1228  Sex_0  Sex_1  Sex_2  Sex_3  \
0   0.330683  ...               False   True  False  False  False   
1   0.330683  ...               False  False   True  False  False   
2   0.330683  ...               False  False   True  False  False   
3   0.330683  ...               False  False   True  False  False   
4   0.330683  ...               False  False   True  False  False   

   Age Group_0  Age Group_1  Age Group_2  Age Group_3  Age Group_4  
0        False         True        False        False        False  
1        False        False        False         True        False  
2        False        False         True        False        False  
3        False         True        False        False        False  
4        False        False         True        False        False  

[5 rows x 36680 columns]
Selected features: 

 Index(['Total Non-Fatal Injuries', 'Person Number', 'Crash Number_12',
       'Crash Time_361', 'Max Injury Severity Reported_2',
       'Max Injury Severity Reported_5', 'Max Injury Severity Reported_7',
       'Crash Hour_0', 'Driver Contributing Circumstances (All Drivers)_125',
       'Driver Contributing Circumstances (All Drivers)_471',
       ...
       'Crash Report IDs_3929', 'Crash Report IDs_3941',
       'First Harmful Event Location_2', 'First Harmful Event Location_4',
       'Most Harmful Event (All Vehicles)_132',
       'Traffic Control Device Function_1',
       'Traffic Control Device Function_3',
       'Vehicle Sequence of Events (All Vehicles)_542', 'Vehicle Make_269',
       'Age Group_0'],
      dtype='object', length=7410)
Shape of X_train: (19690, 7410)
Shape of X_test: (4923, 7410)
Shape of y_train: (19690,)
Shape of y_test: (4923,)
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Classifier:  LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')
Accuracy:  1.0
Precision:  1.0
Recall:  0.9990732159406858
F1-Score:  0.9995363931386184
Classifier:  DecisionTreeClassifier()
Accuracy:  0.9999492127983748
Precision:  1.0
Recall:  0.9990732159406858
F1-Score:  0.9995363931386184
Classifier:  RandomForestClassifier()
Accuracy:  1.0
Precision:  1.0
Recall:  0.9990732159406858
F1-Score:  0.9995363931386184
Classifier:  KNeighborsClassifier()
Accuracy:  0.9997968511934993
Precision:  1.0
Recall:  0.9981464318813716
F1-Score:  0.9990723562152134

Logistic Regression Accuracy: 1.0
Decision Tree Accuracy: 1.0
Random Forest Accuracy: 1.0
KNN Accuracy: 1.0